UNVEILING INSIGHTS FROM THE FIFA WORLD CUP DATA¶

THE DATASET & GOAL OF THE ANALYSIS¶

This project delves into the captivating world of the FIFA World Cup using an extensive dataset sourced from the FIFA World Cup Archive. The dataset, available at the following link, encompasses information about every World Cup held annually from 1930 up to 2014 (except during 1942 and 1946 due to the Second World War). It consists of three primary components:

  1. 'World Cups' Dataset: Provides details about each tournament, including host countries, winning teams, goals scored, and participating nations.
  2. 'World Cup Matches' Dataset: Records match information such as dates, stages, teams, goals, and referees.
  3. 'Players' Dataset: Introduces individual players, their teams, positions, and notable events during matches.

The Goal¶

This project has two main objectives: to gain insights through SQL analysis and present these insights through visualizations. By using SQL queries, the aim is to uncover interesting stories, trends, and facts about the World Cup. Answering questions about team performances, match dynamics, and standout players.

To make these insights accessible and engaging, data visualization techniques will be used. The visualizations will showcase match outcomes, team achievements, and player contributions in an easy-to-understand manner.

LIBRARIES¶

In [1]:
#!pip install psycopg2-binary
In [2]:
import pandas as pd
from IPython.display import Image, display
from PIL import Image as PILImage
from sqlalchemy import create_engine, text
import psycopg2
import plotly.express as px
import plotly.graph_objects as go
import dash
from dash import dcc, html 
from plotly.subplots import make_subplots
import plotly.io as pio

DATA CLEANING & EXPLORATION¶

In [3]:
#Let's start by importing the csv files 
world_cup = pd.read_csv("WorldCups.csv")
matches = pd.read_csv("WorldCupMatches.csv")
players = pd.read_csv("WorldCupPlayers.csv")

Next, we will take a quick look at each dataset

EXPLORING & CLEANING DATA¶

World Cup Dataset¶

In [4]:
#First rows of World Cup dataset
world_cup.head()
Out[4]:
Year Country Winner Runners-Up Third Fourth GoalsScored QualifiedTeams MatchesPlayed Attendance
0 1930 Uruguay Uruguay Argentina USA Yugoslavia 70 13 18 590.549
1 1934 Italy Italy Czechoslovakia Germany Austria 70 16 17 363.000
2 1938 France Italy Hungary Brazil Sweden 84 15 18 375.700
3 1950 Brazil Uruguay Brazil Sweden Spain 88 13 22 1.045.246
4 1954 Switzerland Germany FR Hungary Austria Uruguay 140 16 26 768.607
In [5]:
#Last rows of World Cup dataset 
world_cup.tail()
Out[5]:
Year Country Winner Runners-Up Third Fourth GoalsScored QualifiedTeams MatchesPlayed Attendance
15 1998 France France Brazil Croatia Netherlands 171 32 64 2.785.100
16 2002 Korea/Japan Brazil Germany Turkey Korea Republic 161 32 64 2.705.197
17 2006 Germany Italy France Germany Portugal 147 32 64 3.359.439
18 2010 South Africa Spain Netherlands Germany Uruguay 145 32 64 3.178.856
19 2014 Brazil Germany Argentina Netherlands Brazil 171 32 64 3.386.810
In [6]:
#NAs in the World Cup dataset 
missing_values_wc = world_cup.isna().sum()
print(missing_values_wc)
Year              0
Country           0
Winner            0
Runners-Up        0
Third             0
Fourth            0
GoalsScored       0
QualifiedTeams    0
MatchesPlayed     0
Attendance        0
dtype: int64
In [7]:
# Tranform Attendance into integer (remove dots)
world_cup['Attendance'] = world_cup['Attendance'].str.replace('.', '').astype(int)
/var/folders/h9/t189rfrn65l6ph7vmp0hyf7r0000gn/T/ipykernel_24421/3236473007.py:2: FutureWarning: The default value of regex will change from True to False in a future version. In addition, single character regular expressions will *not* be treated as literal strings when regex=True.
  world_cup['Attendance'] = world_cup['Attendance'].str.replace('.', '').astype(int)
In [8]:
# Transform column names to lowercase and remove special characters (avoid problems when populating tables into the database)
world_cup.columns = world_cup.columns.str.replace(r'\W+', '_', regex=True).str.lower()
world_cup.head()
Out[8]:
year country winner runners_up third fourth goalsscored qualifiedteams matchesplayed attendance
0 1930 Uruguay Uruguay Argentina USA Yugoslavia 70 13 18 590549
1 1934 Italy Italy Czechoslovakia Germany Austria 70 16 17 363000
2 1938 France Italy Hungary Brazil Sweden 84 15 18 375700
3 1950 Brazil Uruguay Brazil Sweden Spain 88 13 22 1045246
4 1954 Switzerland Germany FR Hungary Austria Uruguay 140 16 26 768607

Matches Dataset¶

In [9]:
#First rows of Matches dataset
matches.head()
Out[9]:
Year Datetime Stage Stadium City Home Team Name Home Team Goals Away Team Goals Away Team Name Win conditions Attendance Half-time Home Goals Half-time Away Goals Referee Assistant 1 Assistant 2 RoundID MatchID Home Team Initials Away Team Initials
0 1930.0 13 Jul 1930 - 15:00 Group 1 Pocitos Montevideo France 4.0 1.0 Mexico 4444.0 3.0 0.0 LOMBARDI Domingo (URU) CRISTOPHE Henry (BEL) REGO Gilberto (BRA) 201.0 1096.0 FRA MEX
1 1930.0 13 Jul 1930 - 15:00 Group 4 Parque Central Montevideo USA 3.0 0.0 Belgium 18346.0 2.0 0.0 MACIAS Jose (ARG) MATEUCCI Francisco (URU) WARNKEN Alberto (CHI) 201.0 1090.0 USA BEL
2 1930.0 14 Jul 1930 - 12:45 Group 2 Parque Central Montevideo Yugoslavia 2.0 1.0 Brazil 24059.0 2.0 0.0 TEJADA Anibal (URU) VALLARINO Ricardo (URU) BALWAY Thomas (FRA) 201.0 1093.0 YUG BRA
3 1930.0 14 Jul 1930 - 14:50 Group 3 Pocitos Montevideo Romania 3.0 1.0 Peru 2549.0 1.0 0.0 WARNKEN Alberto (CHI) LANGENUS Jean (BEL) MATEUCCI Francisco (URU) 201.0 1098.0 ROU PER
4 1930.0 15 Jul 1930 - 16:00 Group 1 Parque Central Montevideo Argentina 1.0 0.0 France 23409.0 0.0 0.0 REGO Gilberto (BRA) SAUCEDO Ulises (BOL) RADULESCU Constantin (ROU) 201.0 1085.0 ARG FRA
In [10]:
#Last rows of Matches dataset 
matches.tail()
Out[10]:
Year Datetime Stage Stadium City Home Team Name Home Team Goals Away Team Goals Away Team Name Win conditions Attendance Half-time Home Goals Half-time Away Goals Referee Assistant 1 Assistant 2 RoundID MatchID Home Team Initials Away Team Initials
4567 NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN
4568 NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN
4569 NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN
4570 NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN
4571 NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN
In [11]:
#NAs in the Matches dataset 
missing_values_matches = matches.isna().sum()
print(missing_values_matches)
Year                    3720
Datetime                3720
Stage                   3720
Stadium                 3720
City                    3720
Home Team Name          3720
Home Team Goals         3720
Away Team Goals         3720
Away Team Name          3720
Win conditions          3720
Attendance              3722
Half-time Home Goals    3720
Half-time Away Goals    3720
Referee                 3720
Assistant 1             3720
Assistant 2             3720
RoundID                 3720
MatchID                 3720
Home Team Initials      3720
Away Team Initials      3720
dtype: int64
In [12]:
#Drop records where Year is missing  
matches.dropna(subset=['Year'], inplace=True)
In [13]:
#Which countries have the most World Cup participations
matches['Home Team Name'].value_counts()
Out[13]:
Brazil                        82
Italy                         57
Argentina                     54
Germany FR                    43
England                       35
                              ..
Wales                          1
Norway                         1
rn">United Arab Emirates       1
Haiti                          1
rn">Bosnia and Herzegovina     1
Name: Home Team Name, Length: 78, dtype: int64

Some countries have incorrect formatting containing rn". We will be cleaning this

In [14]:
#Which stadiums have seen the most World CUp matches
matches['Stadium'].value_counts()
Out[14]:
Estadio Azteca             19
Jalisco                    14
Olympiastadion             14
Nou Camp - Estadio Le�n    11
Estadio Centenario         10
                           ..
Comunale di Cornaredo       1
Rimnersvallen               1
Tunavallen                  1
White City                  1
Eyravallen                  1
Name: Stadium, Length: 181, dtype: int64
In [15]:
# Identify stadiums with special characters that are not read correctly
matches[matches['Stadium'].str.contains('�', na=False)]['Stadium'].value_counts()
Out[15]:
Nou Camp - Estadio Le�n                      11
Stade V�lodrome                               9
Maracan� - Est�dio Jornalista M�rio Filho     8
Estadio Ol�mpico Chateau Carreras             8
Estadio Jos� Mar�a Minella                    6
Estadio Ol�mpico Universitario                4
Estadio Municipal de Bala�dos                 3
Name: Stadium, dtype: int64
In [16]:
# Check for duplicates based on the 'matchid' column
duplicates = matches[matches.duplicated(subset=['MatchID'], keep=False)]

# Drop duplicates based on the 'matchid' column
matches = matches.drop_duplicates(subset=['MatchID'])
In [17]:
# Corrections for stadiums with special characters
corrections_stadiums = {
    'Nou Camp - Estadio Le�n': 'Nou Camp - Estadio Leon',
    'Stade V�lodrome': 'Stade Velodrome',
    'Maracan� - Est�dio Jornalista M�rio Filho': 'Maracana - Estadio Jornalista Mario Filho',
    'Estadio Ol�mpico Chateau Carreras': 'Estadio Olimpico Chateau Carreras',
    'Estadio Jos� Mar�a Minella': 'Estadio Jose Maria Minella',
    'Estadio Ol�mpico Universitario': 'Estadio Olimpico Universitario',
    'Estadio Municipal de Bala�dos': 'Estadio Municipal de Balaidos'
}

# Apply corrections to the 'Stadium' column
matches['Stadium'] = matches['Stadium'].replace(corrections_stadiums)
In [18]:
#Remove decimal from year column 
matches['Year'] = matches['Year'].astype(int)
In [19]:
matches.head()
Out[19]:
Year Datetime Stage Stadium City Home Team Name Home Team Goals Away Team Goals Away Team Name Win conditions Attendance Half-time Home Goals Half-time Away Goals Referee Assistant 1 Assistant 2 RoundID MatchID Home Team Initials Away Team Initials
0 1930 13 Jul 1930 - 15:00 Group 1 Pocitos Montevideo France 4.0 1.0 Mexico 4444.0 3.0 0.0 LOMBARDI Domingo (URU) CRISTOPHE Henry (BEL) REGO Gilberto (BRA) 201.0 1096.0 FRA MEX
1 1930 13 Jul 1930 - 15:00 Group 4 Parque Central Montevideo USA 3.0 0.0 Belgium 18346.0 2.0 0.0 MACIAS Jose (ARG) MATEUCCI Francisco (URU) WARNKEN Alberto (CHI) 201.0 1090.0 USA BEL
2 1930 14 Jul 1930 - 12:45 Group 2 Parque Central Montevideo Yugoslavia 2.0 1.0 Brazil 24059.0 2.0 0.0 TEJADA Anibal (URU) VALLARINO Ricardo (URU) BALWAY Thomas (FRA) 201.0 1093.0 YUG BRA
3 1930 14 Jul 1930 - 14:50 Group 3 Pocitos Montevideo Romania 3.0 1.0 Peru 2549.0 1.0 0.0 WARNKEN Alberto (CHI) LANGENUS Jean (BEL) MATEUCCI Francisco (URU) 201.0 1098.0 ROU PER
4 1930 15 Jul 1930 - 16:00 Group 1 Parque Central Montevideo Argentina 1.0 0.0 France 23409.0 0.0 0.0 REGO Gilberto (BRA) SAUCEDO Ulises (BOL) RADULESCU Constantin (ROU) 201.0 1085.0 ARG FRA
In [20]:
# Transform column names to lowercase and remove special characters (avoid problems when populating tables into the database)
matches.columns = matches.columns.str.replace(r'\W+', '_', regex=True).str.lower()
matches.head()
Out[20]:
year datetime stage stadium city home_team_name home_team_goals away_team_goals away_team_name win_conditions attendance half_time_home_goals half_time_away_goals referee assistant_1 assistant_2 roundid matchid home_team_initials away_team_initials
0 1930 13 Jul 1930 - 15:00 Group 1 Pocitos Montevideo France 4.0 1.0 Mexico 4444.0 3.0 0.0 LOMBARDI Domingo (URU) CRISTOPHE Henry (BEL) REGO Gilberto (BRA) 201.0 1096.0 FRA MEX
1 1930 13 Jul 1930 - 15:00 Group 4 Parque Central Montevideo USA 3.0 0.0 Belgium 18346.0 2.0 0.0 MACIAS Jose (ARG) MATEUCCI Francisco (URU) WARNKEN Alberto (CHI) 201.0 1090.0 USA BEL
2 1930 14 Jul 1930 - 12:45 Group 2 Parque Central Montevideo Yugoslavia 2.0 1.0 Brazil 24059.0 2.0 0.0 TEJADA Anibal (URU) VALLARINO Ricardo (URU) BALWAY Thomas (FRA) 201.0 1093.0 YUG BRA
3 1930 14 Jul 1930 - 14:50 Group 3 Pocitos Montevideo Romania 3.0 1.0 Peru 2549.0 1.0 0.0 WARNKEN Alberto (CHI) LANGENUS Jean (BEL) MATEUCCI Francisco (URU) 201.0 1098.0 ROU PER
4 1930 15 Jul 1930 - 16:00 Group 1 Parque Central Montevideo Argentina 1.0 0.0 France 23409.0 0.0 0.0 REGO Gilberto (BRA) SAUCEDO Ulises (BOL) RADULESCU Constantin (ROU) 201.0 1085.0 ARG FRA

Players Dataset¶

In [21]:
#First rows of Players dataset
players.head()
Out[21]:
RoundID MatchID Team Initials Coach Name Line-up Shirt Number Player Name Position Event
0 201 1096 FRA CAUDRON Raoul (FRA) S 0 Alex THEPOT GK NaN
1 201 1096 MEX LUQUE Juan (MEX) S 0 Oscar BONFIGLIO GK NaN
2 201 1096 FRA CAUDRON Raoul (FRA) S 0 Marcel LANGILLER NaN G40'
3 201 1096 MEX LUQUE Juan (MEX) S 0 Juan CARRENO NaN G70'
4 201 1096 FRA CAUDRON Raoul (FRA) S 0 Ernest LIBERATI NaN NaN
In [22]:
#Last rows of Matches dataset 
players.tail()
Out[22]:
RoundID MatchID Team Initials Coach Name Line-up Shirt Number Player Name Position Event
37779 255959 300186501 ARG SABELLA Alejandro (ARG) N 19 ALVAREZ NaN NaN
37780 255959 300186501 GER LOEW Joachim (GER) N 6 KHEDIRA NaN NaN
37781 255959 300186501 ARG SABELLA Alejandro (ARG) N 20 AGUERO NaN IH46' Y65'
37782 255959 300186501 GER LOEW Joachim (GER) N 21 MUSTAFI NaN NaN
37783 255959 300186501 ARG SABELLA Alejandro (ARG) N 23 BASANTA NaN NaN
In [23]:
#NAs in the Players dataset 
missing_values_players = players.isna().sum()
print(missing_values_players)
RoundID              0
MatchID              0
Team Initials        0
Coach Name           0
Line-up              0
Shirt Number         0
Player Name          0
Position         33641
Event            28715
dtype: int64
In [24]:
# Transform column names to lowercase and remove special characters (avoid problems when populating tables into the database)
players.columns = players.columns.str.replace(r'\W+', '_', regex=True).str.lower()
players.head()
Out[24]:
roundid matchid team_initials coach_name line_up shirt_number player_name position event
0 201 1096 FRA CAUDRON Raoul (FRA) S 0 Alex THEPOT GK NaN
1 201 1096 MEX LUQUE Juan (MEX) S 0 Oscar BONFIGLIO GK NaN
2 201 1096 FRA CAUDRON Raoul (FRA) S 0 Marcel LANGILLER NaN G40'
3 201 1096 MEX LUQUE Juan (MEX) S 0 Juan CARRENO NaN G70'
4 201 1096 FRA CAUDRON Raoul (FRA) S 0 Ernest LIBERATI NaN NaN

Insights & Takeaways:¶

  • The World Cup dataset provides a snapshot of different tournaments, showcasing trends in goals scored, qualified teams, and attendance.
  • Records with missing 'Year' values were dropped to maintain data integrity.
  • The Matches dataset allowed to identify countries with the most World Cup participations, where Brazil, Italy, and Argentina led the list.
  • Some stadium names had special characters that were not read correctly. Corrections were applied to ensure accurate analysis.
  • The Players dataset offers insights into player positions, coach names, and significant events during matches.
  • Missing values in 'Position' and 'Event' columns were identified, which might suggest that not all players have event data.

ER DIAGRAM (3NF)¶

Retaining the 3 datasets (matches, palyers, world_cup) as separate entities is be a good starting point for an ER diagram in 3N. Each of these entities represents a distinct concept in the World Cup data, and their attributes are well-defined.

The relationships between these entities could be represented as follows:

  • A Match is played in a specific WorldCup (Year).
  • A Player participates in a specific Match (MatchID).
  • A WorldCup has many Matches (Year).

A world cup has many matches (one-to-many relationship) and a Match has many players (many-to-one relationship).

In [25]:
#Import image
display(Image(filename='ER_diagram_FIFA.jpg'))

CREATE DATABASE & TABLES¶

In [26]:
# Establish a connection to the PostgreSQL server
conn = psycopg2.connect(
    host="localhost",
    port='5432',
    database="postgres",
    user="postgres",
    password="123"
)

# Set autocommit to true to create the database
conn.autocommit = True

# Create a new database
db_name = "fifa"
cur = conn.cursor()
cur.execute(f"CREATE DATABASE {db_name}")

print(f"{db_name} database has been successfully created in PostgreSQL.")
fifa database has been successfully created in PostgreSQL.
In [27]:
# Close the connection to the initial 'postgres' database
conn.close()

# Pass the connection string to a variable, conn_url
conn_url = 'postgresql://postgres:123@localhost:5432/fifa'

# Create an engine that connects to the new database
engine = create_engine(conn_url)

# Establish a connection
connection = engine.connect()
In [28]:
# Define SQL commands to create tables (without foreign key constraints)
stmt = (
    """
    CREATE TABLE world_cup (
        year integer PRIMARY KEY, 
        country text, 
        winner text,
        runners_up text,
        third text,
        fourth text,
        goalsscored integer,
        qualifiedteams integer,
        matchesplayed integer,
        attendance integer
    );
    """,
    """
    CREATE TABLE matches (
        matchid integer PRIMARY KEY, 
        year integer,
        roundid integer,
        datetime text, 
        stage text,
        stadium text,
        city text,
        home_team_name text,
        home_team_goals integer,
        away_team_goals integer,
        away_team_name text,
        win_conditions text,
        attendance integer,
        half_time_home_goals integer,
        half_time_away_goals integer,
        referee text,
        assistant_1 text,
        assistant_2 text,
        home_team_initials text,
        away_team_initials text,
        FOREIGN KEY (year) REFERENCES world_cup (year)
    );
    """,
    """
    CREATE TABLE players (
        roundid  integer, 
        matchid integer,
        team_initials text,
        coach_name text,
        line_up text,
        shirt_number integer,
        player_name text,
        position text,
        event text,
        FOREIGN KEY (matchid) REFERENCES matches (matchid) 
    );
    """
)


# Create a connection and cursor
with psycopg2.connect(database=db_name, user="postgres", password="123", host="localhost") as conn:
    with conn.cursor() as cur:
        for command in stmt:
            cur.execute(command)
        
        # commit the changes to the database
        conn.commit()
        
print('The tables have been successfully created in PostgreSQL.')
The tables have been successfully created in PostgreSQL.

LOADING THE DATASET INTO DATABASE (ETL)¶

Populate "world_cup" table¶

In [29]:
#Insert into world_cup table in database
world_cup.to_sql('world_cup', con=engine, index=False, if_exists='append')
Out[29]:
20
In [30]:
#Insert into matches table in database
matches.to_sql('matches', con=engine, index=False, if_exists='append')
Out[30]:
836
In [31]:
#Insert into players table in database
players.to_sql('players', con=engine, index=False, if_exists='append')
Out[31]:
784

ANALYSIS & INSIGHTS (queries)¶

Top 5 countries with the most World Cup wins.¶

In [32]:
# SQL query to find the top 5 countries with the most World Cup wins
top5_wc = """
    SELECT winner, COUNT(*) AS wins
    FROM world_cup
    WHERE winner IS NOT NULL
    GROUP BY winner
    ORDER BY wins DESC
    LIMIT 5;
"""

# Execute the query and fetch the results into a Pandas DataFrame
top5_wc_df = pd.read_sql_query(top5_wc, conn)
print(top5_wc_df)
       winner  wins
0      Brazil     5
1       Italy     4
2  Germany FR     3
3   Argentina     2
4     Uruguay     2
/Users/pilarbifaretti/opt/anaconda3/lib/python3.9/site-packages/pandas/io/sql.py:762: UserWarning: pandas only support SQLAlchemy connectable(engine/connection) ordatabase string URI or sqlite3 DBAPI2 connectionother DBAPI2 objects are not tested, please consider using SQLAlchemy
  warnings.warn(

Teams that have reached the finals the most times.¶

In [33]:
# SQL query to find the teams that have reached the finals the most times.
countries_finals_wc = """
    SELECT 
        team_name,
        COUNT(*) AS finals_appearances
    FROM 
        (
            SELECT 
                winner AS team_name
            FROM 
                world_cup
            UNION ALL
            SELECT 
                runners_up AS team_name
            FROM 
                world_cup
        ) AS finals_teams
    WHERE 
        team_name IS NOT NULL
    GROUP BY 
        team_name
    ORDER BY 
        finals_appearances DESC
    LIMIT 10;
"""

# Execute the query and fetch the results into a Pandas DataFrame
countries_finals_wc_df = pd.read_sql_query(countries_finals_wc, conn)
print(countries_finals_wc_df)
        team_name  finals_appearances
0          Brazil                   7
1      Germany FR                   6
2           Italy                   6
3       Argentina                   5
4     Netherlands                   3
5  Czechoslovakia                   2
6          France                   2
7         Uruguay                   2
8         Germany                   2
9         Hungary                   2
/Users/pilarbifaretti/opt/anaconda3/lib/python3.9/site-packages/pandas/io/sql.py:762: UserWarning: pandas only support SQLAlchemy connectable(engine/connection) ordatabase string URI or sqlite3 DBAPI2 connectionother DBAPI2 objects are not tested, please consider using SQLAlchemy
  warnings.warn(

Number of times Teams have reached top 3 places¶

In [34]:
# SQL query to find the teams that have reached the finals the most times.
finalist_wc = """
SELECT
    country AS team_name,
    SUM(CASE WHEN position = 'Winner' THEN 1 ELSE 0 END) AS Winner,
    SUM(CASE WHEN position = 'Runners-Up' THEN 1 ELSE 0 END) AS Runners_Up,
    SUM(CASE WHEN position = 'Third' THEN 1 ELSE 0 END) AS Third
FROM
    (
        SELECT
            winner AS country,
            'Winner' AS position
        FROM
            world_cup
        UNION ALL
        SELECT
            runners_up AS country,
            'Runners-Up' AS position
        FROM
            world_cup
        UNION ALL
        SELECT
            third AS country,
            'Third' AS position
        FROM
            world_cup
    ) AS teams
WHERE
    country IS NOT NULL
GROUP BY
    country
ORDER BY
    Winner DESC, Runners_Up DESC, Third DESC;

"""

# Execute the query and fetch the results into a Pandas DataFrame
finalist_wc_df = pd.read_sql_query(finalist_wc, conn)
print(finalist_wc_df)
         team_name  winner  runners_up  third
0           Brazil       5           2      2
1            Italy       4           2      1
2       Germany FR       3           3      1
3        Argentina       2           3      0
4          Uruguay       2           0      0
5          Germany       1           1      3
6           France       1           1      2
7            Spain       1           0      0
8          England       1           0      0
9      Netherlands       0           3      1
10         Hungary       0           2      0
11  Czechoslovakia       0           2      0
12          Sweden       0           1      2
13          Poland       0           0      2
14             USA       0           0      1
15          Turkey       0           0      1
16           Chile       0           0      1
17         Austria       0           0      1
18         Croatia       0           0      1
19        Portugal       0           0      1
/Users/pilarbifaretti/opt/anaconda3/lib/python3.9/site-packages/pandas/io/sql.py:762: UserWarning: pandas only support SQLAlchemy connectable(engine/connection) ordatabase string URI or sqlite3 DBAPI2 connectionother DBAPI2 objects are not tested, please consider using SQLAlchemy
  warnings.warn(

Average number of goals scored per World Cup edition.¶

In [35]:
# SQL query to find the average number of goals scored per World Cup edition.
goals_wc = """
    SELECT year, country AS host_country, AVG(goalsscored) AS avg_goals 
    FROM world_cup
    GROUP BY year
    ORDER BY year;
"""

# Execute the query and fetch the results into a Pandas DataFrame
goals_wc_df = pd.read_sql_query(goals_wc, conn)
print(goals_wc_df)
    year  host_country  avg_goals
0   1930       Uruguay       70.0
1   1934         Italy       70.0
2   1938        France       84.0
3   1950        Brazil       88.0
4   1954   Switzerland      140.0
5   1958        Sweden      126.0
6   1962         Chile       89.0
7   1966       England       89.0
8   1970        Mexico       95.0
9   1974       Germany       97.0
10  1978     Argentina      102.0
11  1982         Spain      146.0
12  1986        Mexico      132.0
13  1990         Italy      115.0
14  1994           USA      141.0
15  1998        France      171.0
16  2002   Korea/Japan      161.0
17  2006       Germany      147.0
18  2010  South Africa      145.0
19  2014        Brazil      171.0
/Users/pilarbifaretti/opt/anaconda3/lib/python3.9/site-packages/pandas/io/sql.py:762: UserWarning: pandas only support SQLAlchemy connectable(engine/connection) ordatabase string URI or sqlite3 DBAPI2 connectionother DBAPI2 objects are not tested, please consider using SQLAlchemy
  warnings.warn(

Host country that had the highest percentage of matches won.¶

In [36]:
# SQL query to find the host country that had the highest percentage of matches won.
top_host_wc = """
    SELECT country,
           ROUND(SUM(CASE WHEN winner = home_team_name THEN 1 ELSE 0 END) * 100.0 / COUNT(*), 2) AS percentage_won
    FROM world_cup
    JOIN matches ON world_cup.year = matches.year
    GROUP BY country
    ORDER BY percentage_won DESC
    LIMIT 1;
"""

# Execute the query and fetch the results into a Pandas DataFrame
top_host_wc_df = pd.read_sql_query(top_host_wc, conn)
print(top_host_wc_df)
   country  percentage_won
0  Uruguay           22.22
/Users/pilarbifaretti/opt/anaconda3/lib/python3.9/site-packages/pandas/io/sql.py:762: UserWarning: pandas only support SQLAlchemy connectable(engine/connection) ordatabase string URI or sqlite3 DBAPI2 connectionother DBAPI2 objects are not tested, please consider using SQLAlchemy
  warnings.warn(

Correlation between the number of qualified teams and the total goals scored.¶

In [37]:
# SQL query to find the correlation between the number of qualified teams and the total goals scored.
qualified_goals_wc = """
    SELECT CORR(qualifiedteams, goalsscored) AS correlation
    FROM world_cup;
"""

# Execute the query and fetch the results into a Pandas DataFrame
qualified_goals_wc_df = pd.read_sql_query(qualified_goals_wc, conn)
print(qualified_goals_wc_df)
   correlation
0     0.866201
/Users/pilarbifaretti/opt/anaconda3/lib/python3.9/site-packages/pandas/io/sql.py:762: UserWarning: pandas only support SQLAlchemy connectable(engine/connection) ordatabase string URI or sqlite3 DBAPI2 connectionother DBAPI2 objects are not tested, please consider using SQLAlchemy
  warnings.warn(

Average Goals by Stage¶

In [38]:
# SQL query to find the average Goals by Stage
goals_stage_wc = """
    SELECT stage,
           AVG(home_team_goals + away_team_goals) AS avg_goals
    FROM matches
    GROUP BY stage
    ORDER BY avg_goals DESC;
"""

# Execute the query and fetch the results into a Pandas DataFrame
goals_stage_wc_df = pd.read_sql_query(goals_stage_wc, conn)
print(goals_stage_wc_df)
                       stage  avg_goals
0          Preliminary round   5.375000
1                First round   4.888889
2                    Group 6   4.750000
3                Third place   4.500000
4      Match for third place   3.933333
5                Semi-finals   3.617647
6                      Final   3.578947
7                    Group 4   3.200000
8                    Group 2   3.169492
9   Play-off for third place   3.000000
10            Quarter-finals   2.822581
11                   Group B   2.816667
12                   Group 1   2.806452
13                   Group 3   2.750000
14                   Group A   2.666667
15                   Group C   2.625000
16                   Group D   2.562500
17               Round of 16   2.562500
18                   Group E   2.520833
19                   Group H   2.400000
20                   Group G   2.333333
21                   Group F   1.937500
22                   Group 5   1.500000
/Users/pilarbifaretti/opt/anaconda3/lib/python3.9/site-packages/pandas/io/sql.py:762: UserWarning: pandas only support SQLAlchemy connectable(engine/connection) ordatabase string URI or sqlite3 DBAPI2 connectionother DBAPI2 objects are not tested, please consider using SQLAlchemy
  warnings.warn(

Top 10 cities that have hosted the most matches.¶

In [39]:
# SQL query to find the cities that have hosted the most matches.
city_matches_wc = """
    SELECT city, COUNT(*) AS num_matches_hosted
    FROM matches
    GROUP BY city
    ORDER BY num_matches_hosted DESC
    LIMIT 10;
"""

# Execute the query and fetch the results into a Pandas DataFrame
city_matches_wc_df = pd.read_sql_query(city_matches_wc, conn)
print(city_matches_wc_df)
              city  num_matches_hosted
0     Mexico City                   23
1      Montevideo                   18
2     Guadalajara                   17
3  Rio De Janeiro                   15
4    Johannesburg                   15
5    Buenos Aires                   12
6       Sao Paulo                   12
7          Munich                   11
8            Leon                   11
9  Frankfurt/Main                   10
/Users/pilarbifaretti/opt/anaconda3/lib/python3.9/site-packages/pandas/io/sql.py:762: UserWarning: pandas only support SQLAlchemy connectable(engine/connection) ordatabase string URI or sqlite3 DBAPI2 connectionother DBAPI2 objects are not tested, please consider using SQLAlchemy
  warnings.warn(

World Cup edition with the highest and lowest attendance.¶

In [40]:
# SQL query to find the  World Cup edition with the highest and lowest attendance.
highest_attendance_query = """
    SELECT 
        'Highest Attendance' AS type, 
        wc.year, 
        wc.attendance,
        wc.country AS host_country
    FROM 
        world_cup wc
    ORDER BY 
        wc.attendance DESC
    LIMIT 1;
"""
lowest_attendance_query = """
    SELECT 
        'Lowest Attendance' AS type, 
        wc.year, 
        wc.attendance,
        wc.country AS host_country
    FROM 
        world_cup wc
    WHERE 
        wc.attendance > 0
    ORDER BY 
        wc.attendance
    LIMIT 1;
"""

# Execute the queries and fetch the results into Pandas DataFrames
highest_attendance_df = pd.read_sql_query(highest_attendance_query, conn)
lowest_attendance_df = pd.read_sql_query(lowest_attendance_query, conn)

# Merge the results into a single DataFrame
attendance_df = pd.concat([highest_attendance_df, lowest_attendance_df], ignore_index=True)

# Display the combined result
print(attendance_df)
                 type  year  attendance host_country
0  Highest Attendance  1994     3587538          USA
1   Lowest Attendance  1934      363000        Italy
/Users/pilarbifaretti/opt/anaconda3/lib/python3.9/site-packages/pandas/io/sql.py:762: UserWarning: pandas only support SQLAlchemy connectable(engine/connection) ordatabase string URI or sqlite3 DBAPI2 connectionother DBAPI2 objects are not tested, please consider using SQLAlchemy
  warnings.warn(
/Users/pilarbifaretti/opt/anaconda3/lib/python3.9/site-packages/pandas/io/sql.py:762: UserWarning: pandas only support SQLAlchemy connectable(engine/connection) ordatabase string URI or sqlite3 DBAPI2 connectionother DBAPI2 objects are not tested, please consider using SQLAlchemy
  warnings.warn(

Average attendance for matches in each stage.¶

In [41]:
# SQL query to find the average attendance for matches in each stage.
attendance_matches_wc = """
    SELECT stage,
           AVG(attendance) AS avg_attendance
    FROM matches
    GROUP BY stage
    ORDER BY avg_attendance DESC;
"""

# Execute the query and fetch the results into a Pandas DataFrame
attendance_matches_wc_df = pd.read_sql_query(attendance_matches_wc, conn)
print(attendance_matches_wc_df)
                       stage  avg_attendance
0                      Final    76470.263158
1   Play-off for third place    68034.000000
2                    Group 6    65658.583333
3                Semi-finals    58956.235294
4                Third place    57741.500000
5                    Group A    54321.350000
6                Round of 16    51592.666667
7                    Group B    51367.866667
8      Match for third place    50847.866667
9                    Group G    48533.166667
10                   Group H    46747.033333
11                   Group E    45675.916667
12                   Group C    45514.937500
13                   Group D    45427.333333
14            Quarter-finals    44543.903226
15                   Group F    42840.875000
16                   Group 1    41664.064516
17                   Group 5    35354.000000
18                   Group 3    34271.410714
19                   Group 2    34241.762712
20                   Group 4    25915.745455
21         Preliminary round    16875.000000
22               First round    16120.333333
/Users/pilarbifaretti/opt/anaconda3/lib/python3.9/site-packages/pandas/io/sql.py:762: UserWarning: pandas only support SQLAlchemy connectable(engine/connection) ordatabase string URI or sqlite3 DBAPI2 connectionother DBAPI2 objects are not tested, please consider using SQLAlchemy
  warnings.warn(

Attendance vs. goals scored for each match.¶

In [42]:
# SQL query to find the 
matches_goals_wc = """
SELECT
        m.matchid,
        m.year,
        m.stage,
        m.attendance,
        m.home_team_goals AS home_goals,
        m.away_team_goals AS away_goals
    FROM
        matches AS m;
"""

# Execute the query and fetch the results into a Pandas DataFrame
matches_goals_wc_df = pd.read_sql_query(matches_goals_wc, conn)
print(matches_goals_wc_df)
       matchid  year           stage  attendance  home_goals  away_goals
0         1096  1930         Group 1      4444.0           4           1
1         1090  1930         Group 4     18346.0           3           0
2         1093  1930         Group 2     24059.0           2           1
3         1098  1930         Group 3      2549.0           3           1
4         1085  1930         Group 1     23409.0           1           0
..         ...   ...             ...         ...         ...         ...
831  300186504  2014  Quarter-finals     68551.0           1           0
832  300186508  2014     Round of 16     58817.0           2           1
833  300186459  2014     Round of 16     41242.0           1           1
834  300186503  2014     Round of 16     63255.0           1           0
835  300186497  2014     Round of 16     51227.0           2           1

[836 rows x 6 columns]
/Users/pilarbifaretti/opt/anaconda3/lib/python3.9/site-packages/pandas/io/sql.py:762: UserWarning: pandas only support SQLAlchemy connectable(engine/connection) ordatabase string URI or sqlite3 DBAPI2 connectionother DBAPI2 objects are not tested, please consider using SQLAlchemy
  warnings.warn(

Referee who has officiated the most matches.¶

In [43]:
# SQL query to find the referee who has officiated the most matches.
referee_matches_wc = """
    SELECT referee,
           COUNT(*) AS matches_officiated
    FROM matches
    GROUP BY referee
    ORDER BY matches_officiated DESC
    LIMIT 1;
"""

# Execute the query and fetch the results into a Pandas DataFrame
referee_matches_wc_df = pd.read_sql_query(referee_matches_wc, conn)
print(referee_matches_wc_df)
                 referee  matches_officiated
0  Ravshan IRMATOV (UZB)                   9
/Users/pilarbifaretti/opt/anaconda3/lib/python3.9/site-packages/pandas/io/sql.py:762: UserWarning: pandas only support SQLAlchemy connectable(engine/connection) ordatabase string URI or sqlite3 DBAPI2 connectionother DBAPI2 objects are not tested, please consider using SQLAlchemy
  warnings.warn(

Average number of goals scored in matches where Ravshan IRMATOV officiated.¶

In [44]:
# SQL query to find the average number of goals scored in matches where Ravshan IRMATOV  officiated
irmatov_matches_wc = """
SELECT AVG(home_team_goals + away_team_goals) AS avg_goals
FROM matches
WHERE referee = 'Ravshan IRMATOV (UZB)';
"""

# Execute the query and fetch the results into a Pandas DataFrame
irmatov_matches_wc_df = pd.read_sql_query(irmatov_matches_wc, conn)
print(irmatov_matches_wc_df)
   avg_goals
0   2.333333
/Users/pilarbifaretti/opt/anaconda3/lib/python3.9/site-packages/pandas/io/sql.py:762: UserWarning: pandas only support SQLAlchemy connectable(engine/connection) ordatabase string URI or sqlite3 DBAPI2 connectionother DBAPI2 objects are not tested, please consider using SQLAlchemy
  warnings.warn(

Average number of goals scored by home teams and away teams.¶

In [45]:
# SQL query to find the World Cup edition with the highest average goals per match.

most_avg_goals_wc = """
    SELECT year, AVG(home_team_goals + away_team_goals) AS avg_goals_per_match
    FROM matches
    GROUP BY year
    ORDER BY avg_goals_per_match DESC
    LIMIT 1;
"""

# Execute the query and fetch the results into a Pandas DataFrame
most_avg_goals_wc_df = pd.read_sql_query(most_avg_goals_wc, conn)
print(most_avg_goals_wc_df)
   year  avg_goals_per_match
0  1954             5.384615
/Users/pilarbifaretti/opt/anaconda3/lib/python3.9/site-packages/pandas/io/sql.py:762: UserWarning: pandas only support SQLAlchemy connectable(engine/connection) ordatabase string URI or sqlite3 DBAPI2 connectionother DBAPI2 objects are not tested, please consider using SQLAlchemy
  warnings.warn(

World Cup edition with the highest average goals per match.¶

In [46]:
# SQL query to find the average number of goals scored by home teams and away teams.
goals_HT_AT_wc = """
    SELECT AVG(home_team_goals) AS avg_home_goals,
           AVG(away_team_goals) AS avg_away_goals
    FROM matches;
"""

# Execute the query and fetch the results into a Pandas DataFrame
goals_HT_AT_wc_df = pd.read_sql_query(goals_HT_AT_wc, conn)
print(goals_HT_AT_wc_df)
   avg_home_goals  avg_away_goals
0        1.824163        1.021531
/Users/pilarbifaretti/opt/anaconda3/lib/python3.9/site-packages/pandas/io/sql.py:762: UserWarning: pandas only support SQLAlchemy connectable(engine/connection) ordatabase string URI or sqlite3 DBAPI2 connectionother DBAPI2 objects are not tested, please consider using SQLAlchemy
  warnings.warn(

Players who have served as captains the most number of times.¶

In [47]:
# SQL query to find the top 10 players who have served as captains the most number of times.
top_captains_wc = """
SELECT 
    player_name,
    team_initials AS country,
    COUNT(*) AS captaincy_count
FROM 
    players
WHERE 
    position = 'C'
GROUP BY 
    player_name, team_initials
ORDER BY 
    captaincy_count DESC
LIMIT 10;
"""

# Execute the query and fetch the results into a Pandas DataFrame
top_captains_wc_df = pd.read_sql_query(top_captains_wc, conn)
print(top_captains_wc_df)
         player_name country  captaincy_count
0               LAHM     GER               17
1     Diego MARADONA     ARG               16
2    Kazimierz DEYNA     POL               13
3         R. MARQUEZ     MEX               13
4  Daniel PASSARELLA     ARG               12
5     Ladislav NOVAK     TCH               12
6              MESSI     ARG               12
7         Uwe SEELER     FRG               12
8     Michel PLATINI     FRA               11
9              DUNGA     BRA               11
/Users/pilarbifaretti/opt/anaconda3/lib/python3.9/site-packages/pandas/io/sql.py:762: UserWarning: pandas only support SQLAlchemy connectable(engine/connection) ordatabase string URI or sqlite3 DBAPI2 connectionother DBAPI2 objects are not tested, please consider using SQLAlchemy
  warnings.warn(

Top 5 players with the most goals across all World Cups.¶

In [48]:
# SQL query to find the Top 5 players with the most goals across all World Cups.
top_players_wc = """
    SELECT 
        p.player_name, 
        m.home_team_name AS player_country,
        COUNT(*) AS goal_count
    FROM 
        players AS p
    JOIN
        matches AS m ON p.matchid = m.matchid
    WHERE 
        p.event LIKE '%G%'
    GROUP BY 
        p.player_name, m.home_team_name
    ORDER BY 
        goal_count DESC
    LIMIT 5;
"""

# Execute the query and fetch the results into a Pandas DataFrame
top_players_wc_df = pd.read_sql_query(top_players_wc, conn)
print(top_players_wc_df)
                          player_name player_country  goal_count
0                             RONALDO         Brazil           8
1  PEL� (Edson Arantes do Nascimento)         Brazil           8
2                          Uwe SEELER     Germany FR           8
3                               KLOSE        Germany           7
4                           JAIRZINHO         Brazil           6
/Users/pilarbifaretti/opt/anaconda3/lib/python3.9/site-packages/pandas/io/sql.py:762: UserWarning: pandas only support SQLAlchemy connectable(engine/connection) ordatabase string URI or sqlite3 DBAPI2 connectionother DBAPI2 objects are not tested, please consider using SQLAlchemy
  warnings.warn(

World Cup edition with the highest percentage of away team wins.¶

In [49]:
# SQL query to find the World Cup edition with the highest percentage of away team wins.
at_wins_wc = """
    SELECT
        wc.year,
        (COUNT(CASE WHEN m.away_team_goals > m.home_team_goals THEN 1 END)::float / COUNT(*)) * 100 AS away_win_percentage
    FROM
        world_cup wc
    JOIN
        matches m ON wc.year = m.year
    GROUP BY
        wc.year
    ORDER BY
        away_win_percentage DESC
    LIMIT 1;

"""

# Execute the query and fetch the results into a Pandas DataFrame
at_wins_wc_df = pd.read_sql_query(at_wins_wc, conn)
print(at_wins_wc_df)
   year  away_win_percentage
0  2010              39.0625
/Users/pilarbifaretti/opt/anaconda3/lib/python3.9/site-packages/pandas/io/sql.py:762: UserWarning: pandas only support SQLAlchemy connectable(engine/connection) ordatabase string URI or sqlite3 DBAPI2 connectionother DBAPI2 objects are not tested, please consider using SQLAlchemy
  warnings.warn(

VISUALIZATION¶

Top 5 Countries with the Most World Cup Wins (up to 2014)¶

In [50]:
fig1 = px.bar(
    top5_wc_df, 
    x='winner', 
    y='wins', 
    title='Top 5 Countries with the Most World Cup Wins (up to 2014)',
    color='winner',  
    color_discrete_sequence=px.colors.qualitative.Prism
)

fig1.update_xaxes(title_text='Country')
fig1.update_yaxes(title_text='Number of Wins')
fig1.show()

Teams That Have Reached the Finals the Most¶

In [51]:
fig2 = px.bar(
    countries_finals_wc_df, 
    x='team_name', 
    y='finals_appearances', 
    title='Teams That Have Reached the Finals the Most Times',
    color='team_name',  # Color bars by team
    color_discrete_sequence=px.colors.qualitative.Prism
)
fig2.update_xaxes(title_text='Team')
fig2.update_yaxes(title_text='Finals Appearances')
fig2.show()

Number of Times Teams Have Reached the Top 3¶

In [52]:
fig3 = px.bar(
    finalist_wc_df, 
    x='team_name', 
    y=['winner', 'runners_up', 'third'], 
    title='Number of Times Teams Have Reached the Top 3 Places',
    labels={'value': 'Count', 'variable': 'Position'},
    color_discrete_sequence=px.colors.qualitative.Prism  
)
fig3.update_xaxes(title_text='Team')
fig3.update_yaxes(title_text='Count')
fig3.show()

Average Goals Scored per World Cup Edition¶

In [53]:
fig4 = px.line(
    goals_wc_df, 
    x='year', 
    y='avg_goals', 
    title='Average Goals Scored per World Cup Edition',
    color_discrete_sequence=px.colors.qualitative.Prism,  
    hover_name='host_country' ,  
    labels={'avg_goals': 'Average Goals'} 
)
fig4.update_xaxes(title_text='Year')
fig4.update_yaxes(title_text='Average Goals')
fig4.show()

Top 5 Players with the Most Goals Across WC¶

In [54]:
fig5 = px.bar(
    top_players_wc_df, 
    x='player_name', 
    y='goal_count', 
    title='Top 5 Players with the Most Goals Across All World Cups',
    color='player_country', 
    color_discrete_sequence=px.colors.qualitative.Prism  
)
fig5.update_xaxes(title_text='Player')
fig5.update_yaxes(title_text='Goal Count')
fig5.show()

WC Edition with Highest and Lowest Attendance¶

In [55]:
fig6 = px.bar(
    attendance_df, 
    x='year', 
    y='attendance', 
    color='type',  # Color bars by type (highest/lowest)
    title='World Cup Edition with Highest and Lowest Attendance',
    color_discrete_sequence=px.colors.qualitative.Prism,
    hover_data=['host_country'], 
    labels={'host_country': 'country'} 

)
fig6.update_xaxes(title_text='Year')
fig6.update_yaxes(title_text='Attendance')
fig6.show()

Average Attendance for Matches in Each Stage¶

In [56]:
fig7 = px.bar(
    attendance_matches_wc_df, 
    x='stage', 
    y='avg_attendance', 
    title='Average Attendance for Matches in Each Stage',
    color='stage',  # Color bars by stage
    color_discrete_sequence=px.colors.qualitative.Prism
)
fig7.update_xaxes(title_text='Stage')
fig7.update_yaxes(title_text='Average Attendance')
fig7.show()

Players Who Have Served as Captains the Most Number of Times¶

In [57]:
fig8 = px.bar(
    top_captains_wc_df, 
    x='player_name', 
    y='captaincy_count', 
    title='Players Who Have Served as Captains the Most Number of Times',
    color='country',  
    color_discrete_sequence=px.colors.qualitative.Prism, 

)
fig8.update_xaxes(title_text='Player')
fig8.update_yaxes(title_text='Captaincy Count')
fig8.show()

Top 10 Cities That Have Hosted the Most Matches¶

In [58]:
fig9 = px.bar(
    city_matches_wc_df, 
    x='city', 
    y='num_matches_hosted', 
    title='Top 10 Cities That Have Hosted the Most Matches',
    color='city',  
    color_discrete_sequence=px.colors.qualitative.Prism)

fig9.update_xaxes(title_text='City')
fig9.update_yaxes(title_text='Number of Matches Hosted')
fig9.show()

Attendance vs. Total Goals Scored for Each Match¶

In [59]:
# Add a new column 'total_goals' which is the sum of home and away goals
matches_goals_wc_df['total_goals'] = matches_goals_wc_df['home_goals'] + matches_goals_wc_df['away_goals']

# Create the scatter plot with the new 'total_goals' column and Prism color palette
fig10 = px.scatter(matches_goals_wc_df, x="attendance", y="total_goals",
                 title="Attendance vs. Total Goals Scored for Each Match",
                 color_discrete_sequence=px.colors.qualitative.Prism)

# Customize the layout
fig10.update_layout(
    xaxis_title="Attendance",
    yaxis_title="Total Goals Scored",
    hovermode="closest"
)

# Show the plot
fig10.show()
In [60]:
#Export some of the graphs (for blog post) 

# Export fig1 as HTML
pio.write_html(fig1, file='fig1.html', auto_open=False)

# Export fig3 as HTML
pio.write_html(fig3, file='fig3.html', auto_open=False)

# Export fig4 as HTML
pio.write_html(fig4, file='fig4.html', auto_open=False)

# Export fig5 as HTML
pio.write_html(fig5, file='fig5.html', auto_open=False)

# Export fig8 as HTML
pio.write_html(fig8, file='fig8.html', auto_open=False)

# Export fig9 as HTML
pio.write_html(fig9, file='fig9.html', auto_open=False)

CONCLUSION¶

The FIFA World Cup, a pinnacle of international football, has been a spectacle since its inception in 1930. Delving into a comprehensive dataset of World Cup history (until 2014), the following insights were uncovered:

  1. Brazil Dominates the World Cup: Brazil is the undisputed leader when it comes to World Cup wins. They have won the tournament a remarkable five times, more than any other country. This highlights Brazil's dominance in international football.
  2. Strong Finalists: Teams like Brazil, Italy, Argentina, and Germany have consistently reached the finals, demonstrating their prowess in the tournament. These teams are the ones to watch out for in any World Cup edition.
  3. Goals: The average number of goals scored per World Cup edition has increased over the years.
  4. A Stage-by-Stage Story:The World Cup unfolds in stages, each with its unique drama. The preliminary and first rounds feature the highest average goals per match, a testament to the unpredictability and flair that characterize the early stages.
  5. Icononic host cities: Cities like Mexico City, Montevideo, and Guadalajara have hosted the most matches. These cities have been at the heart of World Cup history, hosting numerous memorable matches.
  6. Captains of Leadership: Philipp Lahm, Diego Maradona, and Kazimierz Deyna are among the players who have served as captains the most times. They are not only great players but also exceptional leaders on the field.

In the end, the FIFA World Cup is not just a sporting event; it's a story of nations, passion, and the beautiful game. Each statistic and record is a chapter in this enthralling narrative, weaving a tapestry of football's most illustrious competition. As the world unites every four years to celebrate football, these insights remind us of the rich heritage and unpredictability that make the World Cup a timeless spectacle.

CLOSE CONNECTION¶

Close connection and engine to avoid resource leaks.

In [61]:
# Close the connection
connection.close()

# Close the engine
engine.dispose()